-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dalek NEON v7 #691
base: main
Are you sure you want to change the base?
Dalek NEON v7 #691
Conversation
Co-authored-by: pinkforest <[email protected]> Co-authored-by: Robrecht Blancquaert <[email protected]>
…5x4::shuffle Co-authored-by: Robrecht Blacquaert <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey hey, thanks for submitting this! I got around to looking at this and left some notes for you and myself.
First, this looks overall pretty great. There are just a minor compile-time warnings that I get that need to be resolved. I won't paste here bc you can see it yourself.
On that note, it'd be nice to get a CI test case for this backend. I can look into that.
Finally, this seems like it's really slow on my Macbook Air M1. Am I doing something wrong? I ran with backend=serial, then backend=simd, and saw a 45-50% slowdown across the board. Here's a partial paste of benchmarks.
edwards benches/EdwardsPoint compression
time: [3.3320 µs 3.3329 µs 3.3339 µs]
change: [+1.1223% +1.3350% +1.5407%] (p = 0.00 < 0.05)
Performance has regressed.
edwards benches/EdwardsPoint decompression
time: [3.5899 µs 3.5936 µs 3.5978 µs]
change: [+0.7537% +1.2902% +1.7433%] (p = 0.00 < 0.05)
Change within noise threshold.
edwards benches/Constant-time fixed-base scalar mul
time: [9.2123 µs 9.2148 µs 9.2174 µs]
change: [-6.0098% -5.4511% -4.8764%] (p = 0.00 < 0.05)
Performance has improved.
edwards benches/Constant-time variable-base scalar mul
time: [50.151 µs 50.160 µs 50.171 µs]
change: [+34.156% +42.306% +49.662%] (p = 0.00 < 0.05)
Performance has regressed.
edwards benches/Variable-time aA+bB, A variable, B fixed
time: [48.128 µs 48.142 µs 48.157 µs]
change: [+62.406% +62.559% +62.715%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/1
time: [50.362 µs 50.376 µs 50.392 µs]
change: [+48.626% +50.038% +50.915%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/2
time: [66.040 µs 66.063 µs 66.091 µs]
change: [+49.459% +49.626% +49.793%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/4
time: [96.888 µs 96.919 µs 96.954 µs]
change: [+47.723% +47.968% +48.186%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/8
time: [159.00 µs 159.03 µs 159.07 µs]
change: [+46.456% +46.623% +46.774%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/16
time: [283.98 µs 284.12 µs 284.30 µs]
change: [+46.400% +46.584% +46.797%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/32
time: [533.07 µs 533.21 µs 533.36 µs]
change: [+45.954% +46.096% +46.227%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/64
time: [1.0317 ms 1.0329 ms 1.0349 ms]
change: [+46.202% +47.030% +48.196%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/128
time: [2.0357 ms 2.0364 ms 2.0372 ms]
change: [+44.702% +44.773% +44.844%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/256
time: [4.0719 ms 4.1737 ms 4.3095 ms]
change: [+46.218% +49.911% +54.422%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/384
time: [6.0426 ms 6.0450 ms 6.0474 ms]
change: [+45.247% +45.313% +45.378%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/512
time: [8.0369 ms 8.0400 ms 8.0433 ms]
change: [+44.325% +45.007% +45.447%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/768
time: [12.068 ms 12.079 ms 12.093 ms]
change: [+45.656% +45.797% +45.974%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/1024
time: [16.070 ms 16.080 ms 16.090 ms]
change: [+45.287% +45.574% +45.767%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/1
time: [43.677 µs 43.699 µs 43.724 µs]
change: [+61.865% +62.013% +62.168%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/2
time: [53.354 µs 53.396 µs 53.451 µs]
change: [+60.683% +60.870% +61.053%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/4
time: [72.161 µs 72.232 µs 72.358 µs]
change: [+58.561% +58.725% +58.914%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/8
time: [110.14 µs 110.18 µs 110.21 µs]
change: [+56.365% +56.537% +56.744%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/16
time: [186.40 µs 186.46 µs 186.52 µs]
change: [+54.205% +54.357% +54.516%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/32
time: [339.54 µs 339.68 µs 339.84 µs]
change: [+52.412% +53.057% +53.697%] (p = 0.00 < 0.05)
Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/64
time: [653.32 µs 654.65 µs 655.88 µs]
change: [+51.725% +52.242% +52.851%] (p = 0.00 < 0.05)
Performance has regressed.
fn eq(&self, rhs: &$ty) -> bool { | ||
unsafe { | ||
let m = neon::$beq_intrinsic(self.0, rhs.0); | ||
Self(m).extract::<0>() != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can you extract just the first element to check equality? The docs say that the vectors are equal iff _every_bit in the output vector is set to 1.
impl u32x4 { | ||
#[inline] | ||
pub fn new(x0: u32, x1: u32, x2: u32, x3: u32) -> Self { | ||
unsafe { core::mem::transmute::<[u32; 4], Self>([x0, x1, x2, x3]) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should include a safety note like here
curve25519-dalek/curve25519-dalek/src/backend/vector/packed_simd.rs
Lines 241 to 242 in 4570d80
// SAFETY: Transmuting between an array and a SIMD type is safe | |
// https://rust-lang.github.io/unsafe-code-guidelines/layout/packed-simd-vectors.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, out of curiosity, why the transmute and not a vld*
instruction? Eg the x86 code says this set instruction is faster
u64x4(core::arch::x86_64::_mm256_set_epi64x( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're gonna transmute for everything, you may as well just call const_new
from here instead of duplicating the code. Actually, even better, since they're all the same anyway, just call them new
and splat
and remove the const_*
names entirely
assert_eq!(base_splits[3], b_splits[3]); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self: File looks good. Diff between this and the AVX2 edwards impl is entirely aesthetic
let (b8, b9) = unpack_pair(self.0[4]); | ||
|
||
FieldElement2625x4::reduce64([ | ||
u64x2x2::new(vmull_u32(b0.0.0, consts.0.into()).into(), vmull_u32(b0.0.1, consts.1.into()).into()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this mul32 pattern comes up. Might make sense to make it its own function like in the AVX2 version?
assert_eq!(x2, splits[2]); | ||
assert_eq!(x3, splits[3]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self: Ran this file through Difftastic and it seems mostly the same as the AVX2 version. Need to understand the blending functions better though. Also the rotating
u32x4::const_new(44524351, 50428429, 21904953, 12608048), | ||
), | ||
])), | ||
]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self: seems fine. spot-checked by perturbing constants in random places, and seeing that tests failed. One place it didn't fail: if you change the shift in P_TIMES_16_LO
to 3 instead of 4, everything works fine. That's probably due to some approximation algorithm still succeeding
Continuation of #457 for v7, @Tarinn will continue this while I patch some things around in Rust and LLVM...